Exercises and associated data

The data and modelling objects created in this notebook can be downloaded directly to save computational time.


Users who wish to complete the exercises can download a small template R script. Assuming you have already downloaded the data objects above, this script will load all data objects so that the steps used to create them are not necessary to tackle the exercises.

Load libraries and time series data

This tutorial relates to content covered in Lecture 2 and Lecture 3, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:

library(dplyr)
library(mvgam) 
library(brms)
library(marginaleffects)
library(tidybayes)
library(ggplot2)

We will begin by continuing our work with time series of rodent captures from the Portal Project, a long-term monitoring study based near the town of Portal, Arizona, with particular focus on the time series of captures for the Desert Pocket Mouse Chaetodipus penicillatus. Data manipulations will proceed in a similar way as in Tutorial 1, except that we now use a lagged version of ndvi (at a lag of 12 lunar months) instead of using the current values of ndvi (see ?dplyr::lag for details) to give you a sense of how you can create and use lagged predictors in time series models

data("portal_data")
portal_data %>%
  
  # mvgam requires a 'time' variable be present in the data to index
  # the temporal observations. This is especially important when tracking 
  # multiple time series. In the Portal data, the 'moon' variable indexes the
  # lunar monthly timestep of the trapping sessions
  dplyr::mutate(time = moon - (min(moon)) + 1) %>%
  
  # We can also provide a more informative name for the outcome variable, which 
  # is counts of the 'PP' species (Chaetodipus penicillatus) across all control
  # plots
  dplyr::mutate(count = PP) %>%
  
  # The other requirement for mvgam is a 'series' variable, which needs to be a
  # factor variable to index which time series each row in the data belongs to.
  # Again, this is more useful when you have multiple time series in the data
  dplyr::mutate(series = as.factor('PP')) %>%
  
  # Create a lagged version of ndvi to use as a predictor in place of the 
  # contemporaneous ndvi we have been using previously. Note, the data must be 
  # in the correct order for this to work properly
  dplyr::arrange(time) %>%
  dplyr::mutate(ndvi_lag12 = dplyr::lag(ndvi, 12)) %>%
  
  # Select the variables of interest to keep in the model_data
  dplyr::select(series, time, count,
                mintemp, ndvi_lag12) -> model_data

The data now contain five variables:
series, a factor indexing which time series each observation belongs to
time, the indicator of which time step each observation belongs to
count, the response variable representing the number of captures of the species PP in each sampling observation
mintemp, the monthly average minimum temperature at each time step
ndvi_lag12, the lagged monthly average Normalized Difference Vegetation Index at each time step

Using lagged versions of predictors is a standard practice in time series analysis / forecasting. But unfortunately the software we are using cannot easily handle missing values in the predictors. Because we’ve calculated lags of ndvi, the first 12 rows of our data now have missing values for this predictor:

head(model_data, 14)
##    series time count mintemp ndvi_lag12
## 1      PP    1     0  -9.710         NA
## 2      PP    2     1  -5.924         NA
## 3      PP    3     2  -0.220         NA
## 4      PP    4    NA   1.931         NA
## 5      PP    5    10   6.568         NA
## 6      PP    6    NA  11.590         NA
## 7      PP    7    NA  14.370         NA
## 8      PP    8    16  16.520         NA
## 9      PP    9    18  12.490         NA
## 10     PP   10    12   9.210         NA
## 11     PP   11    NA   0.845         NA
## 12     PP   12     3  -7.080         NA
## 13     PP   13     2  -7.550   1.465889
## 14     PP   14    NA  -3.633   1.558507

Visualize the data as a heatmap to get a sense of where these are distributed (NAs are shown as red bars in the below plot)

image(is.na(t(model_data %>%
                dplyr::arrange(dplyr::desc(time)))), axes = F,
      col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))

For now we will simply omit these missing values and restart the time index at 1:

model_data %>%
  dplyr::filter(!is.na(ndvi_lag12)) %>%
  dplyr::mutate(time = time - min(time) + 1) -> model_data_trimmed
head(model_data_trimmed)

As in the previous tutorial, we also split the data into training and testing subsets for inspecting forecast behaviours. We will do the same here, but use a shorter validation period in this example

model_data_trimmed %>% 
  dplyr::filter(time <= 162) -> data_train 
model_data_trimmed %>% 
  dplyr::filter(time > 162) -> data_test

Forecasting with temporal smooths

Our last tutorial ended with us fitting an mvgam model with a smooth function of time to capture temporal variation. The lectures have illustrated why this might be a bad idea if one of our goals is to produce useful forecasts. We will produce forecasts from this model to show why this can be problematic. Start by refitting the model from Tutorial 1, including the data_test object as newdata to ensure the model automatically produces conditional Bayesian forecasts.

model0 <- mvgam(count ~ s(time, bs = 'bs', k = 15) + 
                  ndvi_lag12 + 
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test)

Now inspect forecasts from this model, which are not great

plot(model0, type = 'forecast', newdata = data_test)

## Out of sample DRPS:
## [1] 328.8323

But what happens if we change k slightly by using 12 basis functions rather than 15?

model0b <- mvgam(count ~ s(time, bs = 'bs', k = 12) + 
                  ndvi_lag12 + 
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test)

Forecasts from this model are even more ridiculous

plot(model0b, type = 'forecast', newdata = data_test)

## Out of sample DRPS:
## [1] 7110

Why is this happening? The forecasts are driven almost entirely by variation in the temporal spline, which is extrapolating linearly forever beyond the edge of the training data. Any slight wiggles near the end of the training set will result in wildly different forecasts. To visualize this, we can plot the extrapolated temporal functions into the out-of-sample test set for the two models. Here are the extrapolated functions for the first model, with 15 basis functions:

plot_mvgam_smooth(model0, smooth = 's(time)',
                  # feed newdata to the plot function to generate
                  # predictions of the temporal smooth to the end of the 
                  # testing period
                  newdata = data.frame(time = 1:max(data_test$time),
                                       mintemp = 0,
                                       ndvi_lag12 = 0))
abline(v = max(data_train$time), lty = 'dashed', lwd = 2)

And now for the second model, with slightly fewer basis functions (12). The difference is clear, with this model reacting to a slightly upward shift in the smooth toward the end of the training period that results in predictions that will head towards infinity:

plot_mvgam_smooth(model0b, smooth = 's(time)',
                  newdata = data.frame(time = 1:max(data_test$time),
                                       mintemp = 0,
                                       ndvi_lag12 = 0))
abline(v = max(data_train$time), lty = 'dashed', lwd = 2)

Try changing the value of k around a bit and seeing how the resulting forecasts change. This should give you more confidence that extrapolating from penalized smooth functions can be unpredictable and very difficult to control. There are some strategies to help reign in this behaviour (once again this advice comes via a blogpost by Gavin Simpson), but results will still be unpredictable and very sensitive to particular choices such as k or the placement of knots. But if we shouldn’t use smooth functions of time for forecasting, what other options do we have?

Exercises

  1. Using advice from Gavin Simpson’s blog post above, try changing the order of the penalty in the temporal smooth to see how the resulting predictions change (use ?b.spline for guidance)
  2. Try using a cubic regression spline in place of the b-spline and inspect the resulting predictions (use bs = 'cr')
Check here for template code if you’re having trouble changing the penalty order in the b-spline
# Replace the ? with the correct value(s)
# Using ?b.spline you will see examples in the Description and Details sections about how these splines can handle multiple penalties
model0 <- mvgam(count ~ s(time, bs = 'bs', k = 15,
                          m = c(?)) + 
                  ndvi_lag12 + 
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test)

Residual correlation structures

Another useful blog post by Gavin Simpson on strategies for fitting nonlinear interactive effects with GAMs to model changing seasonal patterns in time series addresses the possibility of fitting models that explicitly allow for autocorrelated errors. If our model’s residuals are autocorrelated, our estimates of uncertainty around other regression parameters (such as the \(\beta\) coefficients) can be biased and misleading. Failing to address this problem may lead to false inferences and perhaps to poor forecasts. Fortunately there are ways we can tackle this by explicitly modelling the autocorrelation in our model’s errors. This can be done using mgcv functionality, as Gavin explains in his blog post above. But we will use brms here as it is more flexible and provides an interface that is more consistent with mvgam.

A standard Poisson GLM in brms

First, we will fit a similar model to those used in Tutorial 1 so to see how the brms interface works. You will notice that the interface is very familiar, relying on the usual R formula syntax (see ?brmsformula for more details and examples). But there are some key differences between brms and mvgam. For example, brms offers more options for setting prior distributions (see ?get_prior and ?set_prior for details) and can handle a much larger diversity of response distributions (see ?brmsfamily for details). But for now, we will stick with the Poisson distribution and use the same predictors as previously (a nonlinear function of ndvi_lag12 and a parametric, linear function of mintemp):

model1 <- brm(count ~ 
                s(ndvi_lag12, k = 9) +
                mintemp,
              family = poisson(),
              # set a mildly informative prior on beta coefficients
              prior = c(set_prior("normal(0, 2)", class = "b")),
              data = data_train,
              backend = 'cmdstanr',
              iter = 1000,
              cores = 4)

As before, the first step after fitting a model in brms is to inspect the summary to ensure no major diagnostic warnings have been produced and to inspect posterior distributions for key parameters. You may notice some warnings of divergent transitions for this model, which should give you some immediate concern about the validity of the model’s inferences

summary(model1)
## Warning: There were 28 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
##  Family: poisson 
##   Links: mu = log 
## Formula: count ~ s(ndvi_lag12, k = 9) + mintemp 
##    Data: data_train (Number of observations: 134) 
##   Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 2000
## 
## Smooth Terms: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sndvi_lag12_1)     0.29      0.21     0.02     0.82 1.01      565     1027
## 
## Population-Level Effects: 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept         2.39      0.03     2.33     2.45 1.01     1423     1239
## mintemp           0.07      0.00     0.06     0.07 1.00     1199     1431
## sndvi_lag12_1    -0.12      0.58    -1.09     1.60 1.02      223       51
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

For any brms model, we can inspect the underlying Stan code using the stancode function:

stancode(model1)
Check here to see the model’s Stan code
## // generated with brms 2.19.0
## functions {
##   
## }
## data {
##   int<lower=1> N; // total number of observations
##   array[N] int Y; // response variable
##   int<lower=1> K; // number of population-level effects
##   matrix[N, K] X; // population-level design matrix
##   // data for splines
##   int Ks; // number of linear effects
##   matrix[N, Ks] Xs; // design matrix for the linear effects
##   // data for spline s(ndvi_lag12, k = 9)
##   int nb_1; // number of bases
##   array[nb_1] int knots_1; // number of knots
##   // basis function matrices
##   matrix[N, knots_1[1]] Zs_1_1;
##   int prior_only; // should the likelihood be ignored?
## }
## transformed data {
##   int Kc = K - 1;
##   matrix[N, Kc] Xc; // centered version of X without an intercept
##   vector[Kc] means_X; // column means of X before centering
##   for (i in 2 : K) {
##     means_X[i - 1] = mean(X[ : , i]);
##     Xc[ : , i - 1] = X[ : , i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b; // population-level effects
##   real Intercept; // temporary intercept for centered predictors
##   vector[Ks] bs; // spline coefficients
##   // parameters for spline s(ndvi_lag12, k = 9)
##   // standarized spline coefficients
##   vector[knots_1[1]] zs_1_1;
##   real<lower=0> sds_1_1; // standard deviations of spline coefficients
## }
## transformed parameters {
##   // actual spline coefficients
##   vector[knots_1[1]] s_1_1;
##   real lprior = 0; // prior contributions to the log posterior
##   // compute actual spline coefficients
##   s_1_1 = sds_1_1 * zs_1_1;
##   lprior += normal_lpdf(b | 0, 2);
##   lprior += student_t_lpdf(Intercept | 3, 2.6, 2.5);
##   lprior += normal_lpdf(bs | 0, 2);
##   lprior += student_t_lpdf(sds_1_1 | 3, 0, 2.5)
##             - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## model {
##   // likelihood including constants
##   if (!prior_only) {
##     // initialize linear predictor term
##     vector[N] mu = rep_vector(0.0, N);
##     mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
##     target += poisson_log_glm_lpmf(Y | Xc, mu, b);
##   }
##   // priors including constants
##   target += lprior;
##   target += std_normal_lpdf(zs_1_1);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }
## 


Of course we still cannot interpret the smooth effects by simply inspecting the posterior summary. We will need to make targeted plots to do that. But a major advantage of using brms for regression models is the ability to quickly compute and visualize conditional or marginal effects on the scale of the outcome variable. Even when we use linear (parametric) terms in regressions, the relationship between the predictor and the outcome can be hard to understand due to nonlinearities induced by the link functions. The plot below shows such a pattern for the estimated effect of minimum temperature (mintemp) on counts of rodents. Note that when plotting from brms, usually you will end up with a ggplot object that can be further manipulated. I show some possible manipulations here by changing the colours of the functions and adding a rug at the bottom of each plot to show the distributions of observed covariate values:

plot(conditional_effects(model1),
     line_args = list(colour = "#A25050",
                      fill = "#A2505060"),
     rug_args = list(colour = "#A25050"),
     theme = theme_classic(),
     mean = FALSE,
     rug = TRUE)

Computing residuals with tidybayes

A notable difference between brms and mvgam is that the former does not automatically return posterior predictions. These need to be computed after model fitting, which means that residuals also have to be computed. To inspect residual diagnostics from brms models, we can enlist the help of the tidybayes package. More information on this procedure can be found in this tidybayes vignette. Do not worry if this procedure is difficult to understand, it is not of major relevance to the learning objectives of the course (and if you use mvgam, this will be done for you automatically anyway). But it is useful to know where you can go for more information on these redisuals. Expect to see some warnings here, as some of the observations have missing values

data_train %>%
  # calculate posterior predictions
  add_predicted_draws(model1) %>%
  # use predictions to compute randomized quantile residuals
  summarise(
    p_lower = mean(.prediction < count),
    p_upper = mean(.prediction <= count),
    p_residual = runif(1, 
                       max(p_lower, 0.00001), 
                       min(p_upper, 0.99999)),
    z_residual = qnorm(p_residual),
    .groups = "drop_last"
  ) -> residual_data

With residuals now computed, we can perform the usual residual diagnostics. For example, below are residual ACF, pACF and Q-Q plots, all of which show that the model’s predictions do not adequately match the features of the observed data

acf(residual_data$z_residual, 
    na.action = na.pass, 
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = 'Dunn-Smyth residuals')

pacf(residual_data$z_residual, 
    na.action = na.pass, 
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = 'Dunn-Smyth residuals')

ggplot(residual_data, aes(sample = z_residual)) +
  geom_qq() +
  geom_abline(col = 'darkred') +
  theme_classic()
## Warning: Removed 51 rows containing non-finite values (stat_qq).

Autocorrelated error GLM in brms

This model is not doing well. Clearly we need to somehow account for the strong temporal autocorrelation when modelling these data without using a smooth function of time. Now onto another great feature of brms: the ability to include (possibly latent) autocorrelated residuals in regression models. To do so, we use the ar function (see ?brms::ar for details). This model will use a separate sub-model for latent residuals that evolve as an AR1 process (i.e. the error in the current time point is a function of the error in the previous time point, plus some stochastic noise). When using non-Gaussian observations, we can only select an order of 1 for the AR component (higher orders are only currently allowed for Gaussian observations). Note, this model takes considerably longer to fit than the previous model because it estimates the full temporal covariance function for the residuals

model2 <- brm(count ~ 
                s(ndvi_lag12, k = 9) +
                mintemp +
                ar(p = 1, 
                   time = time,
                   cov = TRUE),
              family = poisson(),
              prior = c(set_prior("normal(0, 2)", class = "b")),
              data = data_train,
              backend = 'cmdstanr',
              iter = 1000,
              cores = 4)

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{ndvi})_t + \beta_{mintemp} * \boldsymbol{mintemp}_t + z_t \\ z_t & \sim \text{Normal}(ar1 * z_{t-1}, \sigma_{error}) \\ ar1 & \sim \text{Normal}(0, 1)[-1, 1] \\ \sigma_{error} & \sim \text{Exponential}(2) \\ f(\boldsymbol{ndvi}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{mintemp} & \sim \text{Normal}(0, 1) \end{align*}\]

Here the term \(z_t\) captures autocorrelated latent residuals, which are modelled using an AR1 process. In brms, the AR coefficients are restricted to the interval \([-1,1]\) to ensure the dynamic process is stationary. View the model’s summary to ensure sampler diagnostics look good. Notice how there are now summaries for the residual autoregressive parameters as well

summary(model2)
## Warning: There were 22 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
##  Family: poisson 
##   Links: mu = log 
## Formula: count ~ s(ndvi_lag12, k = 9) + mintemp + ar(p = 1, time = time, cov = TRUE) 
##    Data: data_train (Number of observations: 134) 
##   Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 2000
## 
## Smooth Terms: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sndvi_lag12_1)     0.45      0.43     0.01     1.58 1.01      671      815
## 
## Correlation Structures:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## ar[1]     0.79      0.06     0.68     0.91 1.01      593      968
## sderr     0.65      0.06     0.53     0.78 1.01      627     1046
## 
## Population-Level Effects: 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept         1.84      0.32     1.21     2.53 1.01      583      628
## mintemp           0.09      0.01     0.06     0.11 1.00      732      959
## sndvi_lag12_1     0.22      0.89    -1.84     1.92 1.01      976     1007
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Again you can inspect the Stan code to see how the autocorrelation structure is included in the linear predictor for \(\mu\). This is done using the custom brms function scale_time_err, which estimates correlations among residuals for all time points at once. This is fairly expensive computationally, which is why the model is considerably slower than any of the other models we’ve used so far

stancode(model2)
Check here to see the model’s Stan code
## // generated with brms 2.19.0
## functions {
##   /* integer sequence of values
##    * Args:
##    *   start: starting integer
##    *   end: ending integer
##    * Returns:
##    *   an integer sequence from start to end
##    */
##   array[] int sequence(int start, int end) {
##     array[end - start + 1] int seq;
##     for (n in 1 : num_elements(seq)) {
##       seq[n] = n + start - 1;
##     }
##     return seq;
##   }
##   // are two 1D integer arrays equal?
##   int is_equal(array[] int a, array[] int b) {
##     int n_a = size(a);
##     int n_b = size(b);
##     if (n_a != n_b) {
##       return 0;
##     }
##     for (i in 1 : n_a) {
##       if (a[i] != b[i]) {
##         return 0;
##       }
##     }
##     return 1;
##   }
##   
##   /* grouped data stored linearly in "data" as indexed by begin and end
##    * is repacked to be stacked into an array of vectors.
##    */
##   array[] vector stack_vectors(vector long_data, int n, array[] int stack,
##                                array[] int begin, array[] int end) {
##     int S = sum(stack);
##     int G = size(stack);
##     array[S] vector[n] stacked;
##     int j = 1;
##     for (i in 1 : G) {
##       if (stack[i] == 1) {
##         stacked[j] = long_data[begin[i] : end[i]];
##         j += 1;
##       }
##     }
##     return stacked;
##   }
##   /* scale and correlate time-series residuals
##    * using the Cholesky factor of the correlation matrix
##    * Args:
##    *   zerr: standardized and independent residuals
##    *   sderr: standard deviation of the residuals
##    *   chol_cor: cholesky factor of the correlation matrix
##    *   nobs: number of observations in each group
##    *   begin: the first observation in each group
##    *   end: the last observation in each group
##    * Returns:
##    *   vector of scaled and correlated residuals
##    */
##   vector scale_time_err(vector zerr, real sderr, matrix chol_cor,
##                         array[] int nobs, array[] int begin, array[] int end) {
##     vector[rows(zerr)] err;
##     for (i in 1 : size(nobs)) {
##       matrix[nobs[i], nobs[i]] L_i;
##       L_i = sderr * chol_cor[1 : nobs[i], 1 : nobs[i]];
##       err[begin[i] : end[i]] = L_i * zerr[begin[i] : end[i]];
##     }
##     return err;
##   }
##   /* scale and correlate time-series residuals
##    * allows for flexible correlation matrix subsets
##    * Deviating Args:
##    *   Jtime: array of time indices per group
##    * Returns:
##    *   vector of scaled and correlated residuals
##    */
##   vector scale_time_err_flex(vector zerr, real sderr, matrix chol_cor,
##                              array[] int nobs, array[] int begin,
##                              array[] int end, array[,] int Jtime) {
##     vector[rows(zerr)] err;
##     int I = size(nobs);
##     array[I] int has_err = rep_array(0, I);
##     int i = 1;
##     matrix[rows(chol_cor), cols(chol_cor)] L;
##     matrix[rows(chol_cor), cols(chol_cor)] Cov;
##     L = sderr * chol_cor;
##     Cov = multiply_lower_tri_self_transpose(L);
##     while (i <= I) {
##       array[nobs[i]] int iobs = Jtime[i, 1 : nobs[i]];
##       matrix[nobs[i], nobs[i]] L_i;
##       if (is_equal(iobs, sequence(1, rows(L)))) {
##         // all timepoints are present in this group
##         L_i = L;
##       } else {
##         // arbitrary subsets cannot be taken on L directly
##         L_i = cholesky_decompose(Cov[iobs, iobs]);
##       }
##       err[begin[i] : end[i]] = L_i * zerr[begin[i] : end[i]];
##       has_err[i] = 1;
##       // find all additional groups where we have the same timepoints
##       for (j in (i + 1) : I) {
##         if (has_err[j] == 0 && is_equal(Jtime[j], Jtime[i]) == 1) {
##           err[begin[j] : end[j]] = L_i * zerr[begin[j] : end[j]];
##           has_err[j] = 1;
##         }
##       }
##       while (i <= I && has_err[i] == 1) {
##         i += 1;
##       }
##     }
##     return err;
##   }
##   /* compute the cholesky factor of an AR1 correlation matrix
##    * Args:
##    *   ar: AR1 autocorrelation
##    *   nrows: number of rows of the covariance matrix
##    * Returns:
##    *   A nrows x nrows matrix
##    */
##   matrix cholesky_cor_ar1(real ar, int nrows) {
##     matrix[nrows, nrows] mat;
##     vector[nrows - 1] gamma;
##     mat = diag_matrix(rep_vector(1, nrows));
##     for (i in 2 : nrows) {
##       gamma[i - 1] = pow(ar, i - 1);
##       for (j in 1 : (i - 1)) {
##         mat[i, j] = gamma[i - j];
##         mat[j, i] = gamma[i - j];
##       }
##     }
##     return cholesky_decompose(mat ./ (1 - ar ^ 2));
##   }
##   /* compute the cholesky factor of a MA1 correlation matrix
##    * Args:
##    *   ma: MA1 autocorrelation
##    *   nrows: number of rows of the covariance matrix
##    * Returns:
##    *   A nrows x nrows MA1 covariance matrix
##    */
##   matrix cholesky_cor_ma1(real ma, int nrows) {
##     matrix[nrows, nrows] mat;
##     mat = diag_matrix(rep_vector(1 + ma ^ 2, nrows));
##     if (nrows > 1) {
##       mat[1, 2] = ma;
##       for (i in 2 : (nrows - 1)) {
##         mat[i, i - 1] = ma;
##         mat[i, i + 1] = ma;
##       }
##       mat[nrows, nrows - 1] = ma;
##     }
##     return cholesky_decompose(mat);
##   }
##   /* compute the cholesky factor of an ARMA1 correlation matrix
##    * Args:
##    *   ar: AR1 autocorrelation
##    *   ma: MA1 autocorrelation
##    *   nrows: number of rows of the covariance matrix
##    * Returns:
##    *   A nrows x nrows matrix
##    */
##   matrix cholesky_cor_arma1(real ar, real ma, int nrows) {
##     matrix[nrows, nrows] mat;
##     vector[nrows] gamma;
##     mat = diag_matrix(rep_vector(1 + ma ^ 2 + 2 * ar * ma, nrows));
##     gamma[1] = (1 + ar * ma) * (ar + ma);
##     for (i in 2 : nrows) {
##       gamma[i] = gamma[1] * pow(ar, i - 1);
##       for (j in 1 : (i - 1)) {
##         mat[i, j] = gamma[i - j];
##         mat[j, i] = gamma[i - j];
##       }
##     }
##     return cholesky_decompose(mat ./ (1 - ar ^ 2));
##   }
## }
## data {
##   int<lower=1> N; // total number of observations
##   array[N] int Y; // response variable
##   int<lower=1> K; // number of population-level effects
##   matrix[N, K] X; // population-level design matrix
##   // data for splines
##   int Ks; // number of linear effects
##   matrix[N, Ks] Xs; // design matrix for the linear effects
##   // data for spline s(ndvi_lag12, k = 9)
##   int nb_1; // number of bases
##   array[nb_1] int knots_1; // number of knots
##   // basis function matrices
##   matrix[N, knots_1[1]] Zs_1_1;
##   // data needed for ARMA correlations
##   int<lower=0> Kar; // AR order
##   int<lower=0> Kma; // MA order
##   // see the functions block for details
##   int<lower=1> N_tg;
##   array[N_tg] int<lower=1> begin_tg;
##   array[N_tg] int<lower=1> end_tg;
##   array[N_tg] int<lower=1> nobs_tg;
##   int prior_only; // should the likelihood be ignored?
## }
## transformed data {
##   int Kc = K - 1;
##   matrix[N, Kc] Xc; // centered version of X without an intercept
##   vector[Kc] means_X; // column means of X before centering
##   int max_lag = max(Kar, Kma);
##   int max_nobs_tg = max(nobs_tg); // maximum dimension of the autocorrelation matrix
##   for (i in 2 : K) {
##     means_X[i - 1] = mean(X[ : , i]);
##     Xc[ : , i - 1] = X[ : , i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b; // population-level effects
##   real Intercept; // temporary intercept for centered predictors
##   vector[Ks] bs; // spline coefficients
##   // parameters for spline s(ndvi_lag12, k = 9)
##   // standarized spline coefficients
##   vector[knots_1[1]] zs_1_1;
##   real<lower=0> sds_1_1; // standard deviations of spline coefficients
##   vector[N] zerr; // unscaled residuals
##   real<lower=0> sderr; // SD of residuals
##   vector<lower=-1, upper=1>[Kar] ar; // autoregressive coefficients
## }
## transformed parameters {
##   // actual spline coefficients
##   vector[knots_1[1]] s_1_1;
##   vector[N] err; // actual residuals
##   // cholesky factor of the autocorrelation matrix
##   matrix[max_nobs_tg, max_nobs_tg] Lcortime;
##   real lprior = 0; // prior contributions to the log posterior
##   // compute actual spline coefficients
##   s_1_1 = sds_1_1 * zs_1_1;
##   // compute residual covariance matrix
##   Lcortime = cholesky_cor_ar1(ar[1], max_nobs_tg);
##   // compute correlated time-series residuals
##   err = scale_time_err(zerr, sderr, Lcortime, nobs_tg, begin_tg, end_tg);
##   lprior += normal_lpdf(b | 0, 2);
##   lprior += student_t_lpdf(Intercept | 3, 2.6, 2.5);
##   lprior += normal_lpdf(bs | 0, 2);
##   lprior += student_t_lpdf(sds_1_1 | 3, 0, 2.5)
##             - 1 * student_t_lccdf(0 | 3, 0, 2.5);
##   lprior += student_t_lpdf(sderr | 3, 0, 2.5)
##             - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## model {
##   // likelihood including constants
##   if (!prior_only) {
##     // initialize linear predictor term
##     vector[N] mu = rep_vector(0.0, N);
##     mu += Intercept + Xc * b + Xs * bs + Zs_1_1 * s_1_1 + err;
##     target += poisson_log_lpmf(Y | mu);
##   }
##   // priors including constants
##   target += lprior;
##   target += std_normal_lpdf(zs_1_1);
##   target += std_normal_lpdf(zerr);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }
## 


Autocorrelated error GLM in mvgam

If you have a look at the time indices that are used for estimating the residual AR1 process (hint, look at standata(model2)), you will see that this model is estimating autocorrelated errors for the full time period, even though some of these time points have missing observations. This is useful for getting more realistic estimates of the residual autocorrelation parameters. mvgam does the same when using a latent trend AR1 model, though it is considerably faster to fit because it does not rely on the latent multivariate normal parameterization. Before inspecting the model further, we will now fit an equivalent model using mvgam to see what syntax is needed (note that only 250 posterior samples per chain are taken to reduce the size of the model, which will make it easier to download for completing the exercises):

model3 <- mvgam(count ~ 
                  s(ndvi_lag12, k = 9) +
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test,
                samples = 250,
                trend_model = 'AR1')

Inspect the Stan code for this model to see how it differs from the brms version above

code(model3)
Check here to see the model’s Stan code
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## real p_taus[2]; // prior precisions for parametric coefficients
## real p_coefs[2]; // prior locations for parametric coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## int<lower=0> ytimes[n, n_series]; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[8,16] S1; // mgcv smooth penalty matrix S1
## int<lower=0> n_nonmissing; // number of nonmissing observations
## int<lower=0> flat_ys[n_nonmissing]; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## int<lower=0> obs_ind[n_nonmissing]; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // latent trend AR1 terms
## vector<lower=-1.5,upper=1.5>[n_series] ar1;
## // latent trend variance parameters
## vector<lower=0>[n_series] sigma;
## // latent trends
## matrix[n, n_series] trend;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // parametric effect priors (regularised for identifiability)
## for (i in 1:2) {
## b_raw[i] ~ normal(p_coefs[i], sqrt(1 / p_taus[i]));
## }
## // prior for s(ndvi_lag12)...
## b_raw[3:10] ~ multi_normal_prec(zero[3:10],S1[1:8,1:8] * lambda[1] + S1[1:8,9:16] * lambda[2]);
## // priors for AR parameters
## ar1 ~ std_normal();
## // priors for smoothing parameters
## lambda ~ normal(10, 25);
## // priors for latent trend variance parameters
## sigma ~ exponential(2);
## // trend estimates
## trend[1, 1:n_series] ~ normal(0, sigma);
## for(s in 1:n_series){
## trend[2:n, s] ~ normal(ar1[s] * trend[1:(n - 1), s], sigma[s]);
## }
## {
## // likelihood functions
## vector[n_nonmissing] flat_trends;
## flat_trends = (to_vector(trend))[obs_ind];
## flat_ys ~ poisson_log_glm(append_col(flat_xs, flat_trends),
## 0.0,append_row(b, 1.0));
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## vector[n_series] tau;
## array[n, n_series] int ypred;
## rho = log(lambda);
## for (s in 1:n_series) {
## tau[s] = pow(sigma[s], -2.0);
## }
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){ 
## mus[1:n, s] = eta[ytimes[1:n, s]] + trend[1:n, s];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }


You may notice a separate sub-model for the latent trend component, which evolves as an AR1 process. The line stating trend[2:n, s] ~ normal(ar1[s] * trend[1:(n - 1), s], sigma[s]); takes care of this autocorrelated process. Again this may look a bit daunting, but it is just a vectorized version of an AR1 model (see this vignette on autocorrelated models in the Stan user’s guide for some more details). Summarise the model to see how it now returns posterior summaries for the latent AR1 process:

summary(model3)
## GAM formula:
## count ~ s(ndvi_lag12, k = 9) + mintemp
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## AR1
## 
## N series:
## 1 
## 
## N timepoints:
## 162 
## 
## Status:
## Fitted using Stan 
## 
## GAM coefficient (beta) estimates:
##                   2.5%     50% 97.5% Rhat n.eff
## (Intercept)      2.300  2.4000 2.500 1.02   151
## mintemp          0.055  0.0640 0.073 1.01   282
## s(ndvi_lag12).1 -0.090  0.0093 0.130 1.01   603
## s(ndvi_lag12).2 -0.240  0.0017 0.220 1.02   246
## s(ndvi_lag12).3 -0.047 -0.0026 0.046 1.00   634
## s(ndvi_lag12).4 -0.096  0.0081 0.120 1.02   230
## s(ndvi_lag12).5 -0.064 -0.0049 0.051 1.02   249
## s(ndvi_lag12).6 -0.090  0.0091 0.110 1.02   221
## s(ndvi_lag12).7 -0.490  0.0760 0.720 1.02   218
## s(ndvi_lag12).8 -0.100  0.0840 0.260 1.01   343
## 
## GAM observation smoothing parameter (rho) estimates:
##                    2.5% 50% 97.5% Rhat n.eff
## s(ndvi_lag12)_rho  1.10 3.1   4.2 1.01   309
## s(ndvi_lag12)2_rho 0.99 3.3   4.2 1.01   535
## 
## Approximate significance of GAM observation smooths:
##                edf Ref.df Chi.sq p-value
## s(ndvi_lag12) 3.89   4.84   2.19    0.81
## 
## Latent trend parameter AR estimates:
##          2.5%  50% 97.5% Rhat n.eff
## ar1[1]   0.73 0.83  0.93 1.00   431
## sigma[1] 0.50 0.61  0.74 1.01   288
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 1000 iterations ended with a divergence (0%)
## 0 of 1000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior

Comparing inferences

We can compare inferences from the two models to see that they are very similar. First, plot the ndvi_lag12 smooth function from each model. In these plots, I show how to plot realizations from the posterior smooth functions so that you can view the individual ‘curves’ that the model thinks are possible. Here is code to view the brms functions:

plot(conditional_smooths(model2, smooths = 's(ndvi_lag12, k = 9)', spaghetti = TRUE, ndraws = 30,
                    resolution = 200),
     spaghetti_args = list(colour = "#A2505060"),
     rug_args = list(colour = "#A25050"),
     theme = theme_classic(),
     mean = FALSE,
     rug = TRUE)
## Warning: Ignoring unknown parameters: linewidth

And now view the mvgam functions, again using 30 random posterior draws:

plot_mvgam_smooth(model3, smooth = 's(ndvi_lag12)', realisations = TRUE,
     n_realisations = 30)

We can also view conditional smooths from mvgam using the plot_predictions function from the marginaleffects package:

plot_predictions(model3, 
                 condition = "ndvi_lag12",
                 # include the observed count values
                 # as points, and show rugs for the observed
                 # ndvi and count values on the axes
                 points = 0.5, rug = TRUE) +
  theme_classic()
## Warning: Removed 28 rows containing missing values (geom_point).

The smooth functions are similar for each model, as expected. Now some code to compare estimates for the residual AR1 parameters. First, extract posterior draws for the AR1 parameter from the two models, which can be done using the as.data.frame:

ar1_ests <- rbind(data.frame(
  ar1 = as.data.frame(model3, variable = 'ar1[1]')$`ar1[1]`,
  model = 'mvgam'),
  data.frame(
    ar1 = as.data.frame(model2, variable = 'ar[1]')$`ar[1]`,
    model = 'brms'))

We can use ggplot2 to compare these estimates, which shows that they are also very similar:

ggplot(data = ar1_ests, aes(x = ar1, fill = model)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, col = 'white') +
  facet_grid(rows = 'model') +
  theme_bw() +
  theme(legend.position = 'none') +
  scale_fill_manual(values = c('darkblue', 'darkred')) +
  labs(x = expression(ar1[error]), y = 'Frequency') +
  xlim(c(0,1))
## Warning: Removed 4 rows containing missing values (geom_bar).

AR processes and stationarity

As you may remember from the lectures, the larger this AR1 parameter becomes (in absolute value), the more a time series tends toward nonstationarity. We can look at some example functions to investigate this behaviour in more detail here

Click here if you would like to see the R code used to produce these simulations
# Define a function to simulate AR1 processes with a fixed error variance
simulate_ar1 = function(ar1 = 0.5, N = 50){
  # simulate the initial value of the series
  init <- rnorm(1, mean = 0, sd = 0.25)
  
  # create an empty vector to store the time series values
  states <- vector(length = N)
  
  # set the first value of the states as the initial value
  states[1] <- init
  
  # loop over remaining time points and fill in the AR1 process
  for(t in 2:N){
    states[t] <- rnorm(1, mean = ar1 * states[t - 1],
                       sd = 0.25)
  }
  
  return(states)
}

# plot a single AR1 series with an AR1 parameter = 0.5
plot(x = 1:50, y = simulate_ar1(ar1 = 0.5), 
     type = 'l',
     lwd = 2, col = 'darkred',
     ylab = expression(f(Time)),
     xlab = 'Time', bty = 'l',
     ylim = c(-2,2),
     main = expression(AR[1]~'='~0.5))

# overlay an additional 20 possible series using different colours
for(i in 1:20){
  lines(simulate_ar1(ar1 = 0.5),
        col = sample(c("#DCBCBC",
                       "#C79999",
                       "#B97C7C",
                       "#A25050",
                       "#7C0000"), 1),
        lwd = 2)
}
box(bty = 'l', lwd = 2)


This plot shows how an AR1 process with a smallish AR1 parameter (0.5) approaches stationarity fairly quickly. What happens if we increase this parameter toward the types of values that our models have estimated? The code to produce these simulations is the same as above, except the AR1 parameter was changed to 0.95

The variance of this particular series grows into the future for a much longer time than did the series we simulated above. This is useful for getting a sense about the relative stability of the underlying dynamic process. We can enforce a Random Walk so that the dynamic process will not be stationary (note, a Random Walk trend can only be fit in mvgam; there is currently no way of using such a process in brms, that I’m aware of):

model3b <- mvgam(count ~ 
                  s(ndvi_lag12, k = 9) +
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test,
                trend_model = 'RW')

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{ndvi})_t + \beta_{mintemp} * \boldsymbol{mintemp}_t + z_t \\ z_t & \sim \text{Normal}(z_{t-1}, \sigma_{error}) \\ \sigma_{error} & \sim \text{Exponential}(2) \\ f(\boldsymbol{ndvi}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{mintemp} & \sim \text{Normal}(0, 1) \end{align*}\]

Note the similarity to model3 above. The only difference is that we no longer estimate the AR1 parameter, so the mean of the latent process at time \(t\) is simply the value of the latent process at time \(t-1\). Plotting the trend estimates for the AR1 and Random Walk models gives you further insight into the different assumptions they make about the temporal evolution of the system:

layout(matrix(1:2, nrow = 2))
plot(model3, type = 'trend', newdata = data_test, main = 'AR1 trend')
plot(model3b, type = 'trend', newdata = data_test, main = 'RW trend')

Forecasting with brms

Although we have the posterior estimates, computing forecasts from a brms model takes a bit of work. We need to compute posterior predictions for the fitted data and for the out of sample data. The problem is that the fitted model ignored all rows in which the outcome was NA. So we have to predict for the non-missing values first to generate the true time series predictions for these observations. We then have to use a slight hack to predict for the missing training observations before combining them. This is not technically correct as the values preceeding the missing observations are not used to inform the missing observations, but it is close enough to get a sense of how the model fitted the data. We set the probs argument to be sure we include appropriate posterior empirical quantiles for visualizing prediction uncertainty. The robust = TRUE argument ensures that the predict.brmsfit function returns posterior medians rather than posterior means for the point estimates. See ?predict.brmsfit and ?prepare_predictions.brmsfit for more details

hindcasts <- predict(model2, 
                     probs = c(0.05, 0.2, 0.8, 0.95),
                     robust = TRUE)
head(hindcasts)
##      Estimate Est.Error Q5 Q20 Q80   Q95
## [1,]        3    2.9652  0   1   5  8.00
## [2,]       10    4.4478  4   7  14 18.05
## [3,]       15    5.9304  8  11  20 25.00
## [4,]       21    5.9304 12  16  27 33.00
## [5,]       27    7.4130 16  21  33 40.05
## [6,]       31    7.4130 20  25  38 46.00

Compute the predictions for the missing observations, which will unfortunately ignore the temporal structure of the data

na_hindcasts <- predict(model2,
                        probs = c(0.05, 0.2, 0.8, 0.95),
                        newdata = data_train[which(is.na(data_train$count)),],
                        robust = TRUE)
all_hindcasts <- matrix(NA, nrow = NROW(data_train),
                        ncol = NCOL(na_hindcasts))
all_hindcasts[which(is.na(data_train$count)),] <- na_hindcasts
all_hindcasts[-which(is.na(data_train$count)),] <- hindcasts

Now compute the predictions for the out-of-sample observations

forecasts <- rbind(all_hindcasts,
                   predict(model2, 
                           newdata = data_test,
                           probs = c(0.05, 0.2, 0.8, 0.95),
                           robust = TRUE,
                           oos = 1:NROW(data_test)))

These predictions can now be plotted using the same kind of plotting routines that mvgam uses (though you can use whatever plotting routines you like)

plot(model_data_trimmed$count,
     ylim = c(0, 250),
     pch = 16,
     col = 'white',
     bty = 'l',
     ylab = 'Posterior predictions',
     xlab = 'Time')
# overlay polygons of empirical quantiles
polygon(c(1:max(model_data_trimmed$time), 
          rev(1:max(model_data_trimmed$time))), 
        c(forecasts[,3], 
          rev(forecasts[,6])),
        col = "#DCBCBC", border = NA)
polygon(c(1:max(model_data_trimmed$time), 
          rev(1:max(model_data_trimmed$time))), 
        c(forecasts[,4], 
          rev(forecasts[,5])),
        col = "#B97C7C", border = NA)
# add median predictions as a line
lines(1:max(model_data_trimmed$time),
      forecasts[,1], lwd = 2, col = "#7C0000")
# overlay the observed in-sample and out-of-sample counts
points(model_data_trimmed$count, pch = 16, cex = 0.8,
       col = 'white')
points(model_data_trimmed$count, pch = 16, cex = 0.65)
# add a dashed line to mark the end of training data
abline(v = max(data_train$time), lty = 'dashed', lwd = 2)

As you can see, it takes some work to compute both hindcasts and forecasts from this particular model in brms. mvgam takes care of all of this for us automatically, making it a bit easier to work with for latent autoregressive models. But as we will see in the next section, we can opt for different kinds of dynamic processes for which brms is very well suited.

Exercises

  1. Compare estimates for the latent residual error terms (\(\sigma_{error}\)) from each model. In mvgam, this parameter is called sigma[1], while in brms, it is called sderr
  2. Compare estimates for the parametric effect of minimum temperature (\(\beta_{mintemp}\)) from each model. In mvgam, this parameter is called mintemp, while in brms, it is called b_mintemp
  3. Look at the Dunn-Smyth residuals for model 3 and provide a few comments describing what you see (use plot.mvgam with type = residuals). Does this model seem to capture the relevant features of the autocorrelated observed data?
  4. Inspect posterior hindcasts and forecasts from model 3 using the steps we carried out in Tutorial 1

Gaussian Processes

Another way we can simultaneously capture dynamic errors in ecological time series is by using a squared exponential Gaussian Process to model the latent temporal dynamics:

A Gaussian process defines a probability distribution over functions; in other words every sample from a Gaussian Process is an entire function from the covariate space X to the real-valued output space.” (Betancourt; Robust Gaussian Process Modeling)

In many applied time series analyses, we believe the informativeness of past observations for explaining current data can be represented as a function of how long ago we observed those past observations. Gaussian Processes (GPs) are modelled as multivariate normal distributions with specific covariance functions (often called ‘kernels’) that define the types of correlations the distribution can produce. We can obtain stationary covariance functions whose correlations are dependent on the difference, in time, between two observations. For our purposes we will rely on the squared exponential kernel, which depends on \(\rho\) (often called the length scale parameter) to control how quickly the correlations between the model’s errors decay as a function of time. For these models, covariance decays exponentially fast with the squared distance (in time) between the observations. The functions also depend on a parameter \(\alpha\), which controls the marginal variability of the temporal function at all points; in other words it controls how much the GP term contributes to the linear predictor. Below we simulate from such a GP so that you can get an idea of the kinds of functional shapes that can be produced as you vary the length scale parameter \(\rho\) To visualize the kinds of functions that are possible under this covariance kernel, we make another spaghetti plot:

Click here if you would like to see the R code used to produce these simulations
# set a seed for reproducibility
set.seed(2222)

# simulate from a fairly large length-scale parameter
rho <- 10

# set the alpha parameter to 1 for all simulations
alpha <- 1

# simulate functions that span 50 time points
times <- 1:50

# generate a sequence of prediction values
draw_seq <- seq(min(times), max(times), length.out = 100)

# construct the temporal covariance matrix
Sigma <- alpha^2 * 
  exp(-0.5 * ((outer(draw_seq, draw_seq, "-") / rho) ^ 2)) +
  diag(1e-12, length(draw_seq))

# draw a single realization from the GP distribution
gp_vals <- MASS::mvrnorm(n = 1, 
                         mu = rep(0, length(draw_seq)),
                         Sigma = Sigma)

# plot the single GP draw
plot(x = draw_seq, y = gp_vals, type = 'l',
     lwd = 2, col = 'darkred',
     ylab = expression(f(Time)),
     xlab = 'Time', bty = 'l',
     ylim = c(-3,3),
     main = expression(alpha~'='~1*'; '~rho~'='~10))
# overlay an additional 10 possible functions using different colours
for(i in 1:10){
  draw <- MASS::mvrnorm(n = 1, mu = rep(0, length(draw_seq)),
                        Sigma = Sigma)
  lines(x = draw_seq, y = draw, lwd = 3.5, col = 'white')
  lines(x = draw_seq, y = draw,
        col = sample(c("#DCBCBC",
                       "#C79999",
                       "#B97C7C",
                       "#A25050",
                       "#7C0000"), 1),
        lwd = 2.5)
}
box(bty = 'l', lwd = 2)


This plot shows there are clearly nonlinear functions supported by the GP distribution. It also shows that the temporal autocorrelations have a long ‘memory’, i.e. they tend to change slowly and smoothly over time. In contrast, let’s see what happens when we simulate from a GP with a shorter length scale parameter (the same code as above was used, except the length scale was changed to 3)

These functions are much ‘wigglier’ and can be useful for capturing temporal dynamics with short-term ‘memory’. The fact that GPs are so flexible and able to capture such a wide range of autocorrelation dynamics makes them a useful option in time series modelling. Although GPs may look similar to splines in the functions they return, they are fundamentally different. In particular, a spline only has local knowledge, meaning it has no principled way of understanding how the function itself is evolving across time. But a GP directly models how the function changes over time, which means it is better able to generate out-of-sample predictions.

Priors for GP functions

The length scale parameter \(\rho\) is fundamentally a very challenging parameter to identify, particularly when estimating a latent Gaussian Process. For both the mvgam and brms models, our estimates of this parameter are very wide:

gp_rhos <- rbind(data.frame(
  rho = as.data.frame(model5, variable = 'rho_gp[1]')$`rho_gp[1]`,
  model = 'mvgam'),
  data.frame(
    rho = as.data.frame(model4, variable = 'lscale_gptime')$lscale_gptime,
    model = 'brms'))

ggplot(data = gp_rhos, aes(x = rho, fill = model)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, col = 'white') +
  facet_grid(rows = 'model') +
  theme_bw() +
  theme(legend.position = 'none') +
  scale_fill_manual(values = c('darkblue', 'darkred')) +
  labs(x = expression(rho[GP]), y = 'Frequency') +
  xlim(c(0,24))
## Warning: Removed 4 rows containing missing values (geom_bar).

This doesn’t cause issue in our example, but sometimes this lack of identifiability can cause problems. Imagine a scenario where a very short length scale leads to overly wiggly functions that compete with other temporal processes we’d like to estimate. Often in practice it can be useful to place a fairly restrictive prior that imposes the kind of smoothness we might expect to see. We haven’t spent much time on priors so far, and indeed that won’t be a big part of this introductory workshop. But if you plan to employ GPs, it is worth taking a small amount of time to look into prior choices for length scale parameters. In both mvgam and brms, the \(\rho\) parameter must be non-negative. If we want the trend to evolve smoothly, we could place a tight normal prior centred on the type of length scale we’d expect. Let’s try a model where we impose a prior that pulls the length scale parameter toward 28 months. In mvgam, this can be done by first creating a data.frame of all parameters for which we are allowed to modify the default priors:

priors <- get_mvgam_priors(count ~ 
                             s(ndvi_lag12, k = 9) +
                             mintemp,
                           family = nb(),
                           data = data_train,
                           trend_model = 'GP')
priors
##                            param_name param_length
## 1                         (Intercept)            1
## 2                             mintemp            1
## 3       vector<lower=0>[n_sp] lambda;            1
## 4 vector<lower=0>[n_series] alpha_gp;            1
## 5   vector<lower=0>[n_series] rho_gp;            1
## 6          int<lower=1> num_gp_basis;            1
## 7  vector<lower=0>[n_series] phi_inv;            1
##                           param_info                                   prior
## 1                        (Intercept)   (Intercept) ~ student_t(3, 2.6, 2.5);
## 2               mintemp fixed effect           mintemp ~ student_t(3, 0, 2);
## 3    s(ndvi_lag12) smooth parameters                lambda ~ normal(10, 25);
## 4                    trend amplitude              alpha_gp ~ normal(0, 0.5);
## 5                 trend length scale rho_gp ~ inv_gamma(1.499007, 5.670433);
## 6 basis dimension for approximate GP              num_gp_basis = min(20, n);
## 7          inverse of NB dispsersion         phi_inv ~ student_t(3, 0, 0.1);
##                   example_change new_lowerbound new_upperbound
## 1    (Intercept) ~ normal(0, 1);             NA             NA
## 2        mintemp ~ normal(0, 1);             NA             NA
## 3    lambda ~ exponential(0.27);             NA             NA
## 4 alpha_gp ~ normal(0.26, 0.74);             NA             NA
## 5    rho_gp ~ exponential(0.72);             NA             NA
## 6             num_gp_basis = 12;             NA             NA
## 7 phi_inv ~ normal(-0.88, 0.92);             NA             NA

Now we need to replace the prior slot for the relevant parameter (rho_gp) with our chosen prior distribution (see ?get_mvgam_priors for more information). The lower bound of 0 will still be respected so we don’t have to worry about providing any truncation restrictions, but it is important to use syntactically correct Stan code when updating priors in mvgam (i.e. we need to use the correct distribution name and we must include the semicolon at the end of the line)

priors$prior[3] <- 'rho_gp ~ normal(28, 2);'
priors
##                            param_name param_length
## 1                         (Intercept)            1
## 2                             mintemp            1
## 3       vector<lower=0>[n_sp] lambda;            1
## 4 vector<lower=0>[n_series] alpha_gp;            1
## 5   vector<lower=0>[n_series] rho_gp;            1
## 6          int<lower=1> num_gp_basis;            1
## 7  vector<lower=0>[n_series] phi_inv;            1
##                           param_info                                   prior
## 1                        (Intercept)   (Intercept) ~ student_t(3, 2.6, 2.5);
## 2               mintemp fixed effect           mintemp ~ student_t(3, 0, 2);
## 3    s(ndvi_lag12) smooth parameters                 rho_gp ~ normal(28, 2);
## 4                    trend amplitude              alpha_gp ~ normal(0, 0.5);
## 5                 trend length scale rho_gp ~ inv_gamma(1.499007, 5.670433);
## 6 basis dimension for approximate GP              num_gp_basis = min(20, n);
## 7          inverse of NB dispsersion         phi_inv ~ student_t(3, 0, 0.1);
##                   example_change new_lowerbound new_upperbound
## 1    (Intercept) ~ normal(0, 1);             NA             NA
## 2        mintemp ~ normal(0, 1);             NA             NA
## 3    lambda ~ exponential(0.27);             NA             NA
## 4 alpha_gp ~ normal(0.26, 0.74);             NA             NA
## 5    rho_gp ~ exponential(0.72);             NA             NA
## 6             num_gp_basis = 12;             NA             NA
## 7 phi_inv ~ normal(-0.88, 0.92);             NA             NA

Now that we have updated the prior for \(\rho\), let’s see if this makes any impact on our inferences by re-running the model and supplying the priors object to the model (use ?set_prior for advice on how to do the same in brms). Again we use only 250 posterior samples per chain to reduce the model’s size for working with exercises

model6 <- mvgam(count ~ 
                  s(ndvi_lag12, k = 9) +
                  mintemp,
                family = nb(),
                data = data_train,
                newdata = data_test,
                # include the priors dataframe to 
                # update or rho_gp prior
                priors = priors,
                samples = 250,
                trend_model = 'GP')

The posterior estimates from this model show how the rho_gp parameter has been successfully constrained, which gives us more confidence that the prior is important for informing these estimates:

gp_rhos <- rbind(data.frame(
  rho = as.data.frame(model5, variable = 'rho_gp[1]')$`rho_gp[1]`,
  model = 'model5'),
  data.frame(
    rho = as.data.frame(model6, variable = 'rho_gp[1]')$`rho_gp[1]`,
    model = 'model6'))

ggplot(data = gp_rhos, aes(x = rho, fill = model)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, col = 'white') +
  facet_grid(rows = 'model') +
  theme_bw() +
  theme(legend.position = 'none') +
  scale_fill_manual(values = c('darkblue', 'darkred')) +
  labs(x = expression(rho[GP]), y = 'Frequency') +
  xlim(c(0,35))
## Warning: Removed 4 rows containing missing values (geom_bar).

A plot of the resulting temporal trend shows a slightly smoother set of functions, but overall the inferences are the same

plot(model6, type = 'trend', newdata = data_test)

GP covariance estimates

This difference in time-to-convergence of the dynamic process is related to the stationarity (or stability) of the system. A final and practically useful way to interrogate GP functions is to plot how covariance is expected to change over time. Those of you that have worked with variograms or semi-variograms before when analysing spatial data will be used to visualizing how semivariance decays over distance. Here we will visualize how covariance decays over time distances (i.e. how far apart in time to we need to be before our estimate from the GP no longer depends on another?). We can define a few helper functions to generate these plots, which will accept posterior estimates of GP parameters from a fitted mvgam object, compute GP covariance kernels, calculate posterior quantiles and then plot them:

Click here if you would like to see the R function plot_kernels, used to plot GP kernels
# Compute the covariance kernel for a given draw
# of GP alpha and rho parameters
quad_kernel = function(rho, alpha, times){
  covs <- alpha ^ 2 * 
    exp(-0.5 * ((times / rho) ^ 2))
  return(covs)
}

# Compute kernels for each posterior draw
plot_kernels = function(gp_ests, max_time = 50){
  # set up an empty matrix to store the kernel estimates 
  # for each posterior draw
  draw_seq <- seq(0, max_time, length.out = 100)
  kernels <- matrix(NA, nrow = NROW(gp_ests), ncol = 100)
  
  # use a for-loop to estimate the kernel for each draw using
  # our custom quad_kernel() function
  for(i in 1:NROW(gp_ests)){
    kernels[i,] <- quad_kernel(rho = gp_ests$`rho_gp[1]`[i],
                               alpha = gp_ests$`alpha_gp[1]`[i],
                               times = draw_seq)
  }
  
  # Calculate posterior empirical quantiles of the kernels
  probs <- c(0.05, 0.2, 0.5, 0.8, 0.95)
  cred <- sapply(1:NCOL(kernels),
                 function(n) quantile(kernels[,n],
                                      probs = probs))
  
  # Plot the kernel uncertainty estimates
  pred_vals <- draw_seq
  plot(1, xlim = c(0, max_time), ylim = c(0, max(cred)), type = 'n',
       xlab = 'Time difference', ylab = 'Covariance',
       bty = 'L')
  box(bty = 'L', lwd = 2)
  polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[5,])),
          col = "#DCBCBC", border = NA)
  polygon(c(pred_vals, rev(pred_vals)), c(cred[2,], rev(cred[4,])),
          col = "#B97C7C", border = NA)
  lines(pred_vals, cred[3,], col = "#7C0000", lwd = 2.5)
}


Extract posterior estimates of the GP parameters using the as.data.frame function that we’ve been using previously. You can then plot the posterior kernel estimates for model 5 to visualize how temporal error covariance is estimated to change as two time points are further apart. To do this, all we need to supply is the posterior estimates for the GP parameters in the form of a data.frame:

gp_ests <- as.data.frame(model5, variable = c('rho_gp[1]',
                                              'alpha_gp[1]'))
plot_kernels(gp_ests = gp_ests, max_time = 40)

This can be helpful to understand how long the temporal ‘memory’ of a given GP is, which also relates to how smooth it’s underlying functions are

Exercises

  1. Plot the 1st derivative for the GP trend from models 6 and describe how it differs from that of model 5 above (hint, pay attention to how quickly they converge toward flat functions at the end of the forecast horizon). Take a few notes describing how these GP predictions differ from the spline extrapolations from model0 and model0b
  2. Plot the GP covariance kernel for model 6 and compare its shape to the kernels we estimated from model 5 above
  3. Fit an mvgam model that uses a GP trend and a Poisson observation model in place of the Negative Binomial. How do predictions from this model differ?
  4. Consider watching the below lecture by Richard McElreath on Gaussian Processes and their wide uses in statistical modelling

Session Info

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.6          tidybayes_3.0.2        mvgam_1.0.5           
##  [4] insight_0.19.3         marginaleffects_0.13.0 brms_2.19.0           
##  [7] Rcpp_1.0.10            mgcv_1.8-41            nlme_3.1-157          
## [10] dplyr_1.0.10           downloadthis_0.3.2    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-3     ellipsis_0.3.2       ggridges_0.5.3      
##   [4] markdown_1.1         base64enc_0.1-3      fs_1.6.1            
##   [7] rstudioapi_0.14      farver_2.1.1         rstan_2.26.13       
##  [10] svUnit_1.0.6         DT_0.26              fansi_1.0.3         
##  [13] mvtnorm_1.1-3        lubridate_1.8.0      bridgesampling_1.1-2
##  [16] codetools_0.2-18     splines_4.2.1        cachem_1.0.7        
##  [19] knitr_1.42           shinythemes_1.2.0    bayesplot_1.9.0     
##  [22] jsonlite_1.8.4       bsplus_0.1.4         ggdist_3.2.0        
##  [25] shiny_1.7.2          compiler_4.2.1       backports_1.4.1     
##  [28] assertthat_0.2.1     Matrix_1.5-1         fastmap_1.1.0       
##  [31] cli_3.4.0            later_1.3.0          htmltools_0.5.5     
##  [34] prettyunits_1.1.1    tools_4.2.1          igraph_1.3.4        
##  [37] coda_0.19-4          gtable_0.3.1         glue_1.6.2          
##  [40] reshape2_1.4.4       posterior_1.3.1      V8_4.2.1            
##  [43] jquerylib_0.1.4      vctrs_0.6.2          crosstalk_1.2.0     
##  [46] tensorA_0.36.2       xfun_0.40            stringr_1.5.0       
##  [49] ps_1.7.1             collapse_1.9.6       mime_0.12           
##  [52] miniUI_0.1.1.1       lifecycle_1.0.3      gtools_3.9.3        
##  [55] klippy_0.0.0.9500    MASS_7.3-57          zoo_1.8-11          
##  [58] scales_1.2.1         colourpicker_1.2.0   promises_1.2.0.1    
##  [61] Brobdingnag_1.2-9    parallel_4.2.1       inline_0.3.19       
##  [64] shinystan_2.6.0      yaml_2.3.6           curl_4.3.2          
##  [67] gridExtra_2.3        loo_2.5.1            StanHeaders_2.26.13 
##  [70] sass_0.4.5           stringi_1.7.12       highr_0.10          
##  [73] dygraphs_1.1.1.6     checkmate_2.1.0      pkgbuild_1.3.1      
##  [76] zip_2.2.2            cmdstanr_0.5.3       rlang_1.1.0         
##  [79] pkgconfig_2.0.3      matrixStats_0.62.0   distributional_0.3.1
##  [82] evaluate_0.20        lattice_0.20-45      purrr_0.3.4         
##  [85] labeling_0.4.2       rstantools_2.2.0     htmlwidgets_1.5.4   
##  [88] tidyselect_1.2.0     processx_3.7.0       plyr_1.8.7          
##  [91] magrittr_2.0.3       R6_2.5.1             generics_0.1.3      
##  [94] DBI_1.1.3            withr_2.5.0          pillar_1.8.1        
##  [97] xts_0.12.2           abind_1.4-5          tibble_3.1.8        
## [100] crayon_1.5.2         arrayhelpers_1.1-0   utf8_1.2.2          
## [103] rmarkdown_2.21       grid_4.2.1           data.table_1.14.4   
## [106] callr_3.7.3          threejs_0.3.3        digest_0.6.29       
## [109] xtable_1.8-4         tidyr_1.2.1          httpuv_1.6.9        
## [112] scoringRules_1.0.2   RcppParallel_5.1.5   stats4_4.2.1        
## [115] munsell_0.5.0        bslib_0.4.2          shinyjs_2.1.0